Overview
circIMPACT provides a strategy to analyze circular and linear transcriptome rely on data mining and statistical learning techniques. circIMPACT package offers a tidy pipeline, starting from the provided circRNA which expression classifies samples with high and low expression of it, a comprensive transcriptome and molecular pathway analysis are performed, including visualisation, normalization, classification, clustering, Differential Expression Analysis and Gene Set Enrichment Analysis. The package accepts data presented as a matrix of raw counts and allows the inclusion of variables that occur with the experimental setting. A series of functions enables data cleaning by filtering rows, data adjustment by identify and removing the unwonted source of variation and to select the best predictors for modeling the responce variable. A learning technique is applied to build a robust classification model, and also an unsupervised analysis is carried out to gain mechanistic insights and detect the molecular pathway associated with the expression levels of the selected circRNA. Finally, a Differential Expression Analysis identyfied deregulated mRNAs and a GSEA is performed for them. circIMPACT stands for “circRNA impacted genes and pathways”. circIMPACT package version: 0.1.3
circIMPACT provides an R interface to analyze a possible impact of circRNA expression profile in gene expression.
The toolset performs functional enrichment analysis and visualization of gene lists obtained comparing samples with low circRNA expression with those with high expression.
The main tools in circIMPACT are:
marker.detection- detection of circRNAs that can stratified samples by their expression signifincantlygeneexpression- DEG analysis for each circRNA-markers defined by K-means algorithm or by circRNA expression (high vs low)
The input for any of the tools consist in count matrices: from the same samples we need the quantification of circular and linear RNAs.
Installation and loading
library(circIMPACT)
library(DESeq2)
library(data.table)
library(plyr)
library(Rtsne)
library(ggplot2)
library(ggrepel)
library(plotly)
library(ComplexHeatmap)
library(circlize)
library(viridis)
library(knitr)
library(kableExtra)
library(formattable)
library(htmltools)
library(sparkline)
library(tidyverse)
library(RColorBrewer)
library(purrr)
library(magrittr)
library(randomForest)
library(DaMiRseq)
library(webshot)
library(ggpubr)
library(tidyr)
library(factoextra)Input data
data("circularData")
data("count.matrix")
data("meta")
data("coldata.df")
## define sample/condition colors
intgroup.dt <- meta[, .(.N), by = .(sample,
condition)][order(sample),
.(sample), by = condition]
samples.per.condition <- data.frame(intgroup.dt[, .N, by = .(condition)], row.names = "condition")
n.conditions <- nrow(samples.per.condition)
hues <- circIMPACT::gg_color_hue(n.conditions)
for(i in 1:n.conditions){
n.hues <- samples.per.condition[i, "N"] + 2
col.hues <- colorRampPalette(colors = c(hues[i],
"white"))(n.hues)[1:(n.hues-2)]
intgroup.dt[condition == rownames(samples.per.condition)[i], `:=`(color = col.hues,
hue = hues[i])]
}
## create a deseq dataset object
dds.circular <- suppressMessages(DESeqDataSetFromMatrix(countData = ceiling(circularData[, coldata.df$sample[order(coldata.df$condition)]]),
colData = coldata.df[order(coldata.df$condition),],
design = ~ condition))
dds.circular <- suppressMessages(estimateSizeFactors(dds.circular))
sf <- sizeFactors(dds.circular)At first a simply filter of low count is applied to remove background noise.
data.filt <- circularData[rowSums(circularData >= 5) >= 2,]
dds.filt.expr <- suppressMessages(DESeqDataSetFromMatrix(countData = ceiling(data.filt[,coldata.df$sample[order(coldata.df$condition)]]),
colData = coldata.df[order(coldata.df$condition),],
design = ~ condition))
dds.filt.expr <- suppressMessages(estimateSizeFactors(dds.filt.expr))
sf.filt <- sizeFactors(dds.filt.expr)
circNormDeseq <- counts(dds.filt.expr, normalized = T)Selection of circRNA-markers with marker.detection
CircRNA-markers
To establish a possible impact of circRNA in gene expression using different molecular subtypes of human T-ALL we select a subset of circRNA defined as markers: at first by using k-means algorithm we separate samples in two main cluster defined by the circRNA expression, then only circRNA which DESeq adj. p-value<.1 and log fold-change>1.5 have been selected.
circIMPACT <- marker.selection(dat = circNormDeseq, dds = dds.filt.expr, sf = sf.filt, p.cutoff = 0.01, lfc.cutoff = 1, method.d = "spearman", method.c = "complete", k = 2, median = TRUE)The result list will include:
- circ.markers - a data frame with DESeq estimates for each circRNAs.
- circ.targets - a vector of circRNAs defined as possible targets.
- group.df - data.frame with a variable group that indicate the new stratification of samples by circRNAs expression.
- plot - formatted table with info. about circIMPACT expression.
Density expression of circRNA-IMPACT
circMark <- circIMPACT$circ.targetIDS[5]
circMark <-"11:33286412-33287511"
circMark_group.df <- circIMPACT$group.df[circIMPACT$group.df$circ_id==circMark,]
circMark_group.df$counts <- merge(circMark_group.df, reshape2::melt(circNormDeseq[circMark,]), by.x = "sample_id", by.y = "row.names")[,"value"]
mu <- ddply(circMark_group.df, "group", summarise, Mean=mean(counts), Median=median(counts), Variance=sd(counts))
p <- ggplot(circMark_group.df, aes(x=counts, color=group, fill=group)) +
geom_density(alpha=0.3) +
ylim(c(0,0.02)) +
geom_vline(data=mu, aes(xintercept=Median, color=group),
linetype="dashed") +
geom_text(data=mu, aes(x=Median[group=="g1"] - 1,
label=paste0("Median:", round(Median[group=="g1"], 2), "/ Variance:", round(Variance[group=="g1"], 2)), y=0.012),
colour="black", angle=90, text=element_text(size=12)) +
geom_text(data=mu, aes(x=Median[group=="g2"] - 1,
label=paste0("Median:", round(Median[group=="g2"], 3), "/ Variance:", round(as.numeric(Variance[group=="g2"]), 2)), y=0.012),
colour="black", angle=90, text=element_text(size=11)) + scale_fill_brewer(palette="Dark2") +
scale_color_brewer(palette="Dark2") +
labs(title=paste0("circHIPK3 (", circMark, ")", " counts density curve"), x = "Normalized read counts", y = "Density") +
theme_classic() +
theme(axis.text.x = element_text(size=18),
axis.text.y = element_text(size=18),
axis.title = element_text(size=20),
legend.text = element_text(size=15),
legend.title = element_text(size=15),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 13, color = "darkslategrey"))
#> Warning: Ignoring unknown parameters: text
#> Warning: Ignoring unknown parameters: text
pClustering analysis using CircRNA-markers
Clustering analysis is performed using the previuos selected circRNA, defined as circRNA-markers. T-SNE algorithm is used to performe dimensionality reduction and k-means and hierarchical clustering methods are compared in order to identify two cluster of samples.
markers.circrnas <- circIMPACT$circ.targetIDS
mat.filt.mark <- circularData[markers.circrnas, ]
dds.filt.mark <- DESeqDataSetFromMatrix(countData = ceiling(mat.filt.mark[,coldata.df$sample]),
colData = coldata.df,
design = ~ 1)
dds.filt.vst <- varianceStabilizingTransformation(dds.filt.mark, fitType = "local", blind = F)
norm.counts.filt <- assay(dds.filt.vst)## Rtsne function may take some minutes to complete...
set.seed(9)
mydist <- dist(t(norm.counts.filt))
## t-SNE representation
# set a perplexity parameter consistent with the number of samples
tsne_data <- Rtsne(mydist, pca = F, perplexity=1, max_iter=5000)
## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_data$Y)
rownames(d_tsne_1) <- colnames(norm.counts.filt)
## keeping original data
d_tsne_1_original=d_tsne_1
## Creating k-means clustering model, and assigning the result to the data used to create the tsne
fit_cluster_kmeans=kmeans(scale(d_tsne_1), 2)
d_tsne_1_original$cl_kmeans = factor(fit_cluster_kmeans$cluster)
## Creating hierarchical cluster model, and assigning the result to the data used to create the tsne
fit_cluster_hierarchical=hclust(dist(scale(d_tsne_1)))
## setting 2 clusters as output
d_tsne_1_original$cl_hierarchical = factor(cutree(fit_cluster_hierarchical, k=2))
# Plotting the cluster models onto t-SNE output
plot_cluster=function(data, var_cluster, palette)
{
ggplot(data, aes_string(x="V1", y="V2", color=var_cluster)) +
geom_point(size=3) +
guides(colour=guide_legend(override.aes=list(size=3))) +
geom_text_repel(aes(label = rownames(data)),
hjust = 0.5, vjust = -1) +
xlab("") + ylab("") +
ggtitle("") +
theme_light(base_size=11) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
legend.direction = "horizontal",
legend.position = "bottom",
legend.box = "horizontal") +
scale_colour_brewer(palette = palette)
}
plot_k=plot_cluster(d_tsne_1_original, "cl_kmeans", "Dark2")
plot_h=plot_cluster(d_tsne_1_original, "cl_hierarchical", "Set1")
## and finally: putting the plots side by side with gridExtra lib...
library(gridExtra)
#>
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:randomForest':
#>
#> combine
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following object is masked from 'package:Biobase':
#>
#> combine
#> The following object is masked from 'package:BiocGenerics':
#>
#> combine
grid.arrange(plot_k, plot_h, ncol=2)t-SNE dimensionality reduction representation. K-means and hierarchical clustering are compared.
CircRNA ranking
After Principal Component Analysis for each circRNAs previously selected a contribution in variance explanation of the two condition comparade will be computed in order to rank circRNA for further analysis. The results can be investigated with visual representation in the PCA plot, by using fviz_pca_biplot function in the package factoextra or with a table with measure of contribution, obtained by get_pca_var function, for each circRNAs.
pca <- prcomp(x = t(norm.counts.filt), center = T)
d <- data.frame(pca$x[rownames(coldata.df), c("PC1", "PC2")], coldata.df)
PC1.var <- summary(pca)$importance["Proportion of Variance", 1]
PC2.var <- summary(pca)$importance["Proportion of Variance", 2]
g1 <- ggplot(data = d,
mapping = aes(x = PC1, y = PC2, shape = condition)) +
geom_point(size = 5) +
coord_fixed(ratio = 1) +
xlab(paste0("PC1: ", percent(PC1.var))) +
ylab(paste0("PC2: ", percent(PC2.var))) +
scale_shape_manual("", values = c(16,18)) +
theme_classic() +
theme(legend.position = "bottom",
legend.direction = "vertical",
plot.title = element_text(hjust = .5),
text = element_text(size=20),
axis.text.x = element_text(size=20),
axis.text.y = element_text(size=20),
aspect.ratio = 1)
circRNArank = circ_contrib(matrix_pc = pca, K = 25)
ggpubr::ggarrange(ggarrange(g1, circRNArank$plot,
ncol = 2,
labels = c("PCA", "")),
nrow = 1)circ_contrib function select the top K circRNAs ranked by their contirubution in variance explanation for PC1 and PC2.
PCA biplot
library(factoextra)
res = fviz_contrib(pca, choice ="var", axes = 1, top = 25)
top25 = res$data$name[order(res$data$contrib, decreasing=T)][1:25]
circAnnotation = read.csv("/media/Data/Li/circRNA_expression_per_sample.csv", header = T)
circAnnotation$circCHR = sub(":.*", "", circAnnotation$circ_id)
circAnnotation$circstart = gsub(".*:(.*)\\-.*", "\\1", circAnnotation$circ_id)
circAnnotation$circend = sub(".*-", "\\1", circAnnotation$circ_id)
circAnnotation$circ_id <- paste0(circAnnotation$circCHR, ":", as.numeric(circAnnotation$circstart)-1, "-", circAnnotation$circend)
rownames(pca$rotation) = paste0("circ", circAnnotation$gene_names[match( rownames(norm.counts.filt), circAnnotation$circ_id)], "_", rownames(norm.counts.filt))
p = fviz_pca_biplot(pca, #select.ind = list(contrib = 5),
select.var = list(contrib = 25),
ggtheme = theme_minimal(),
title = "CircRNAs PCA loadings",
# Individuals
geom.ind = "point",
fill.ind = coldata.df$condition, col.ind = "black",
pointshape = 21, pointsize = 2,
palette = "jco",
addEllipses = FALSE,
repel = TRUE,
# Variables
alpha.var ="contrib", col.var = "contrib",
gradient.cols = "PuRd",
legend.title = list(fill = "Condition", color = "Contrib",
alpha = "Contrib")) +
labs(subtitle = "Top 25 variables with highest contribution of the event to PC1 and PC2",
caption = "") +
theme(axis.text.x = element_text(size=18),
axis.text.y = element_text(size=18),
axis.title = element_text(size=20),
legend.text = element_text(size=15),
legend.title = element_text(size=15),
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 13, color = "darkslategrey"))Heatmap using CircRNA-IMPACT
set.seed(201)
dds.filt.mark <- estimateSizeFactors(dds.filt.mark)
circNormDeseq <- counts(dds.filt.mark, normalized = T)
base_mean = log2(rowMeans(circNormDeseq)+0.001)
mat_scaled = t(apply(dt, 1, function(x) scale(x = x, center = T, scale = T)))
colnames(mat_scaled) <- colnames(dt)
cond = colData(dds.filt.expr)$condition
## choice of kmeans results as cluster of samples
clus = d_tsne_1_original$cl_kmeans
cond.colors <- unique(intgroup.dt$hue)
names(cond.colors) <- unique(intgroup.dt$condition)
ha = HeatmapAnnotation(df = data.frame(condition = cond, cluster = clus),
col = list(condition = cond.colors),
show_annotation_name = F,
annotation_legend_param = list(condition = list(nrow = 2, direction = "horizontal")))
mat.dend <- as.dendrogram(fit_cluster_hierarchical)
fit_cluster_kmeans$cluster
#> SRR5398213 SRR5398214 SRR5398215 SRR5398216 SRR5398217 SRR5398218
#> 2 2 2 1 1 1
ht <- Heatmap(mat_scaled, name = "expression",
# km = 2,
# column_km = 2,
column_order = names(fit_cluster_kmeans$cluster[order(fit_cluster_kmeans$cluster)]),
col = colorRamp2(c(-2, 0, 2), c("blue", "white", "red")),
top_annotation = ha,
# top_annotation_height = unit(4, "mm"),
clustering_distance_columns = "euclidean",
clustering_method_column = "complete",
cluster_columns = F,
clustering_distance_rows = "spearman",#"minkowski",
clustering_method_rows = "ward.D2",
cluster_rows = T,
# row_dend_side = "right",
# row_names_side = "left",
show_row_names = T,
show_column_names = F,
width = unit(9, "cm"),
show_row_dend = T,
show_column_dend = T,
# row_dend_reorder = TRUE,
row_names_gp = gpar(fontsize = 5),
heatmap_legend_param = list(direction = "horizontal")) +
Heatmap(base_mean, name = "log2(base mean)", show_row_names = F, width = unit(2, "mm"), col = inferno(255), show_column_names = F, row_names_gp = gpar(fontsize = 5), heatmap_legend_param = list(direction = "horizontal"))
draw(ht, heatmap_legend_side = "bottom", annotation_legend_side = "bottom")Gene expression analysis with circIMPACTs
DEGs
Because of these analysis is highly time consuming for this tutorial have been reported the DEGs for only the circHIPK3.
## filter out genes low expressed
min.count <- 20
min.col <- 5
filt.mat <- count.matrix[rowSums(count.matrix >= min.count) >= min.col, ]#normalized circRNAs data
circNormDeseq <- counts(dds.filt.expr, normalized = T) %>% as.data.frame()
circNormDeseq$circ_id <- rownames(circNormDeseq)
library(doParallel)
no_cores <- detectCores() - 5
registerDoParallel(cores=no_cores)
gene_mark <- foreach::foreach(i=1:25, .combine = rbind) %dopar% {
results.temp <- data.frame(geneexpression(circ_idofinterest = top25[i], circRNAs = circNormDeseq,
linearRNAs = filt.mat, colData = coldata.df, padj = 0.05,
group = circIMPACT$group.df[circIMPACT$group.df$circ_id%in%top25[i],],
covariates = NULL), circIMPACT = top25[i])
}
gene_mark_hipk3 <- data.frame(geneexpression(circ_idofinterest = "11:33286412-33287511",
circRNAs = circNormDeseq,
linearRNAs = filt.mat,
colData = coldata.df,
padj = 0.1,
group = circIMPACT$group.df[circIMPACT$group.df$circ_id%in%"11:33286412-33287511",],
covariates = NULL),
circIMPACT = "11:33286412-33287511")
gene_mark <- as.data.table(gene_mark_hipk3)
gene_mark %>% dplyr::rename("Gene" = "gene_id", "logFC" = "log2FoldChange") %>%
arrange(padj) %>%
select(circIMPACT, Gene, logFC) %>% head(20) %>%
formattable::formattable(., align = c("c","c","c"), list(
gene_id = formattable::formatter("span", style = ~ formattable::style(color = "grey", font.weight = "bold")),
circIMPACT = formattable::formatter("span", style = ~ formattable::style(color = "grey", font.weight = "bold")),
logFC = circIMPACT::color_tile3(digits = 3, n = 18, fun = "comma", palette = "PiYG")))| circIMPACT | Gene | logFC |
|---|---|---|
| 11:33286412-33287511 | FHL1 | 5.770 |
| 11:33286412-33287511 | NFASC | 3.992 |
| 11:33286412-33287511 | NTRK2 | 4.288 |
| 11:33286412-33287511 | PDE7B | 5.679 |
| 11:33286412-33287511 | PGM5 | 6.356 |
| 11:33286412-33287511 | ITIH5 | 4.937 |
| 11:33286412-33287511 | NCAM1 | 3.840 |
| 11:33286412-33287511 | EYA1 | 4.655 |
| 11:33286412-33287511 | PREX2 | 3.972 |
| 11:33286412-33287511 | HPSE2 | 4.211 |
| 11:33286412-33287511 | ZNF208 | 4.652 |
| 11:33286412-33287511 | MEG8 | 3.308 |
| 11:33286412-33287511 | ADAM33 | 3.374 |
| 11:33286412-33287511 | COL21A1 | 5.406 |
| 11:33286412-33287511 | SENP6 | 3.599 |
| 11:33286412-33287511 | AMY2B | 4.361 |
| 11:33286412-33287511 | SPATA6 | 3.033 |
| 11:33286412-33287511 | ADAMTSL3 | 3.757 |
| 11:33286412-33287511 | CFD | 4.544 |
| 11:33286412-33287511 | LMOD1 | 4.651 |
# knitr::kable(gene_mark %>% dplyr::group_by(circRNA_markers, n.degs) %>%
# dplyr::summarise(DEGs = paste(sort(gene_id),collapse=", ")),
# escape = F, align = "c", row.names = T, caption = "circRNA-DEGs assosiation") %>% kable_styling(c("striped"), full_width = T)
gene_mark[gene_mark$gene_id=="HPSE",]
#> gene_id baseMean log2FoldChange lfcSE stat pvalue padj
#> 1: HPSE 437.0168 -1.894013 0.7530104 -2.515255 0.01189462 0.06367762
#> n.degs circIMPACT
#> 1: 3232 11:33286412-33287511
head(gene_mark)
#> gene_id baseMean log2FoldChange lfcSE stat pvalue
#> 1: A2M 41583.1370 2.480957 0.5423687 4.574300 4.778147e-06
#> 2: AACS 1289.5278 -1.535410 0.3962545 -3.874808 1.067089e-04
#> 3: AASS 2560.3852 2.967515 0.4803190 6.178217 6.482976e-10
#> 4: AATBC 136.1551 -1.852042 0.8110959 -2.283382 2.240786e-02
#> 5: ABCA6 1215.2060 2.631402 0.5001882 5.260825 1.434108e-07
#> 6: ABCA7 4738.4586 -1.250943 0.4750517 -2.633278 8.456514e-03
#> padj n.degs circIMPACT
#> 1: 1.759616e-04 3232 11:33286412-33287511
#> 2: 1.907132e-03 3232 11:33286412-33287511
#> 3: 1.019357e-07 3232 11:33286412-33287511
#> 4: 9.790059e-02 3232 11:33286412-33287511
#> 5: 1.034480e-05 3232 11:33286412-33287511
#> 6: 5.008060e-02 3232 11:33286412-33287511
# Make a basic volcano plot
gene_mark_hipk3$expression = ifelse(gene_mark_hipk3$padj < 0.1 & abs(gene_mark_hipk3$log2FoldChange) >= 1.5,
ifelse(gene_mark_hipk3$log2FoldChange > 1.5 ,'Up','Down'),
'Stable')
p <- ggplot(data = gene_mark_hipk3,
aes(x = log2FoldChange,
y = -log10(padj),
colour=expression,
label = gene_id)) +
geom_point(alpha=0.4, size=3.5) +
geom_text_repel(aes(label=ifelse((abs(log2FoldChange) >= 5)&(padj<=0.01), gene_id, "")), color = "black") +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-6.5, 6.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
labs(x="log2(fold change)",
y="-log10 (adj.p-value)",
title="") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank(),
legend.text = element_text(size=20),
text = element_text(size=20),
axis.text.x = element_text(size=20),
axis.text.y = element_text(size=20))
p
#> Warning: Removed 2 rows containing missing values (geom_point).
#> Warning: Removed 2 rows containing missing values (geom_text_repel).
#> Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
#> increasing max.overlapsmy_data = data.frame(circHIPK3 = as.vector(t(circNormDeseq["11:33286412-33287511",-7])),
HPSE = filt.mat["HPSE",])
ggscatter(my_data, x = "circHIPK3", y = "HPSE",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "circHIPK3 expression", ylab = "QKI gene expression")
#> `geom_smooth()` using formula 'y ~ x'Classification
# library(doParallel)
library(caret)
#> Carico il pacchetto richiesto: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#>
#> lift
library(dplyr)
library(tidyr)
library(data.table)
# no_cores <- detectCores() - 6
# registerDoParallel(cores=no_cores)
# gene_class <- foreach::foreach(i=1:25, .combine = list) %dopar% {
#
# results.temp <- gene_class(circ_idofinterest = top25[i], circRNAs = circNormDeseq,
# linearRNAs = filt.mat, colData = coldata.df,
# group = circIMPACT$group.df[circIMPACT$group.df$circ_id%in%top25[i],],
# covariates = NULL, th.corr = 0.5)
# }
gene_class_hipk3 <- circIMPACT::gene_class(circ_idofinterest = "11:33286412-33287511", circRNAs = circNormDeseq,
linearRNAs = filt.mat,
colData = coldata.df,
group =
circIMPACT$group.df[circIMPACT$group.df$circ_id%in%"11:33286412-33287511",],
covariates = NULL,
th.corr = 0.5)
#> converting counts to integer mode
#> Note: levels of factors in the design contain characters other than
#> letters, numbers, '_' and '.'. It is recommended (but not required) to use
#> only letters, numbers, and delimiters '_' or '.', as these are safe characters
#> for column names in R. [This is a message, not an warning or error]
#> Note: levels of factors in the design contain characters other than
#> letters, numbers, '_' and '.'. It is recommended (but not required) to use
#> only letters, numbers, and delimiters '_' or '.', as these are safe characters
#> for column names in R. [This is a message, not an warning or error]
#> 14352 Genes have been discarded for classification 89 Genes remained.
#p
#ggplotly(p)VI <- importance(gene_class_hipk3$RF)
VI.mat <- as.data.frame(VI)
VI.mat <- round(VI.mat, 4)
VI.mat <- VI.mat[order(VI.mat$MeanDecreaseAccuracy, decreasing = T),]
VI.mat <- as.data.table(VI.mat)
r <- rownames(VI)
VI.mat$gene <- r
# VI.mat
VI.mat %>%
mutate_if(is.numeric, function(x) {
cell_spec(x, bold = T,
color = spec_color(x, end = 0.9),
font_size = spec_font_size(x))
}) %>%
# mutate(Species = cell_spec(
# Species, color = "white", bold = T,
# background = spec_color(1:10, end = 0.9, option = "A", direction = -1)
# )) %>%
kable(escape = F, align = "c", row.names = F, caption = "Table of selected genes used for classification of subgroups defined by circRNAs variation. For each class of response variable there is a OOB error rate of classification. In the 4th column there is the importance of the variable in the growing of the the random forest") %>%
kable_styling(c("striped"), full_width = F)| high | low | MeanDecreaseAccuracy | MeanDecreaseGini | gene |
|---|---|---|---|---|
| 2.8398 | 2.7207 | 2.7923 | 0.0377 | A2ML1 |
| 2.6088 | 2.5668 | 2.6069 | 0.0347 | ABI3BP |
| 2.4073 | 2.4569 | 2.4358 | 0.0377 | ACKR2 |
| 2.4073 | 2.3156 | 2.4211 | 0.0363 | ADAMTS19 |
| 2.4569 | 2.3636 | 2.4191 | 0.029 | ADH1B |
| 2.3636 | 2.4073 | 2.4089 | 0.045 | AF165147.1 |
| 2.3636 | 2.3636 | 2.4079 | 0.0313 | AL627171.2 |
| 2.3636 | 2.3636 | 2.4079 | 0.045 | ANKAR |
| 2.4073 | 2.3301 | 2.4079 | 0.0333 | ANKRD36C |
| 2.3636 | 2.3636 | 2.4079 | 0.045 | ANO4 |
| 2.3636 | 2.3636 | 2.3825 | 0.041 | ARPC1B |
| 2.1261 | 2.2417 | 2.2053 | 0.0353 | ASXL3 |
| 2.143 | 2.2417 | 2.2027 | 0.0413 | ATF3 |
| 2.188 | 2.188 | 2.2027 | 0.0373 | ATOH8 |
| 2.143 | 2.188 | 2.1962 | 0.0373 | ATRNL1 |
| 2.188 | 2.143 | 2.1962 | 0.0207 | BX470102.1 |
| 2.1261 | 2.188 | 2.1828 | 0.044 | C1orf210 |
| 2.188 | 2.143 | 2.1729 | 0.0293 | CCNE1 |
| 2.1153 | 2.188 | 2.1687 | 0.0367 | CDC20 |
| 2.004 | 2.004 | 2.004 | 0.0557 | CFD |
| 2.004 | 2.004 | 2.004 | 0.0447 | CHI3L1 |
| 2.004 | 1.9451 | 1.9795 | 0.033 | CHRM3 |
| 2.004 | 1.9451 | 1.9795 | 0.0333 | CKB |
| 1.9451 | 1.9008 | 1.9678 | 0.0337 | CLCN3P1 |
| 1.9451 | 1.9008 | 1.9678 | 0.0247 | CLIC2 |
| 2.004 | 1.8932 | 1.9678 | 0.0267 | COL10A1 |
| 1.8932 | 2.004 | 1.9678 | 0.036 | COL11A1 |
| 1.9451 | 1.9008 | 1.9678 | 0.022 | COL21A1 |
| 1.9451 | 1.9451 | 1.9649 | 0.022 | CREB5 |
| 2.004 | 1.9008 | 1.9649 | 0.03 | CRTAC1 |
| 2.004 | 1.9008 | 1.9649 | 0.0343 | DES |
| 2.004 | 1.9008 | 1.9649 | 0.0517 | DHCR24 |
| 1.9451 | 1.9451 | 1.9649 | 0.023 | DIRC3 |
| 1.7347 | 1.7347 | 1.7347 | 0.0353 | DOCK10 |
| 1.7347 | 1.7347 | 1.7347 | 0.0297 | DOK6 |
| 1.7347 | 1.669 | 1.7081 | 0.043 | DPY19L2 |
| 1.7347 | 1.669 | 1.7081 | 0.0233 | DTWD1 |
| 1.669 | 1.7347 | 1.7081 | 0.0403 | DUSP2 |
| 1.669 | 1.7347 | 1.7081 | 0.0343 | EBF1 |
| 1.7347 | 1.6352 | 1.7002 | 0.025 | EIF4EBP1 |
| 1.669 | 1.669 | 1.7002 | 0.0317 | EYA1 |
| 1.7347 | 1.6352 | 1.7002 | 0.0503 | FAM107A |
| 1.669 | 1.669 | 1.7002 | 0.0213 | FHL1 |
| 1.669 | 1.669 | 1.7002 | 0.0323 | FPGT_TNNI3K |
| 1.6352 | 1.669 | 1.6668 | 0.0293 | FXYD1 |
| 1.6352 | 1.669 | 1.6668 | 0.0267 | GHR |
| 1.4156 | 1.4156 | 1.4156 | 0.023 | GINS4 |
| 1.3428 | 1.3428 | 1.4156 | 0.032 | HBA2 |
| 1.4156 | 1.4156 | 1.4156 | 0.0333 | HSD17B1 |
| 1.4156 | 1.4156 | 1.4156 | 0.0317 | IFI30 |
| 1.3428 | 1.3428 | 1.4156 | 0.0253 | ITGA8 |
| 1.3428 | 1.3428 | 1.4156 | 0.0243 | ITIH5 |
| 1.4156 | 1.4156 | 1.4156 | 0.0363 | KIAA0825 |
| 1.3428 | 1.3428 | 1.4156 | 0.0157 | KIZ |
| 1.3428 | 1.3428 | 1.4156 | 0.0247 | KLC3 |
| 1.3428 | 1.3428 | 1.4156 | 0.0303 | KRT13 |
| 1.4156 | 1.3428 | 1.4014 | 0.0193 | KRT5 |
| 1.3428 | 1.4156 | 1.3881 | 0.024 | LINC00472 |
| 1.3428 | 1.4156 | 1.3881 | 0.032 | LINC00865 |
| 1.3428 | 1.4156 | 1.3881 | 0.0297 | LINC01608 |
| 1.0005 | 1.0005 | 1.0005 | 0.0253 | MAGEA3 |
| 1.0005 | 1.0005 | 1.0005 | 0.0137 | MMP11 |
| 1.0005 | 1.0005 | 1.0005 | 0.028 | MMUT |
| 1.0005 | 1.0005 | 1.0005 | 0.0187 | MYH11 |
| 1.0005 | 1.0005 | 1.0005 | 0.037 | MYH3 |
| 1.0005 | 1.0005 | 1.0005 | 0.0157 | MYOCD |
| 1.0005 | 1.0005 | 1.0005 | 0.017 | NPHP1 |
| 1.0005 | 1.0005 | 1.0005 | 0.0403 | PALM3 |
| 1.0005 | 1.0005 | 1.0005 | 0.0147 | PDE1C |
| 1.0005 | 1.0005 | 1.0005 | 0.0103 | PDE7B |
| 1.0005 | 1.0005 | 1.0005 | 0.0187 | PDZK1IP1 |
| 1.0005 | 1.0005 | 1.0005 | 0.0317 | PGM5 |
| 0 | 0 | 0 | 0.021 | PI16 |
| 0 | 0 | 0 | 0.019 | PLCD4 |
| 0 | 0 | 0 | 0.0187 | POLE2 |
| 0 | 0 | 0 | 0.014 | PRELP |
| 0 | 0 | 0 | 0.0183 | PRSS8 |
| 0 | 0 | 0 | 0.0293 | RIMS1 |
| 0 | 0 | 0 | 0.014 | S100A14 |
| 0 | 0 | 0 | 0.0267 | SCARA5 |
| 0 | 0 | 0 | 0.0277 | ST8SIA1 |
| 0 | 0 | 0 | 0.0103 | STAC |
| 0 | 0 | 0 | 0.016 | TESPA1 |
| 0 | 0 | 0 | 0.0133 | TMEM132A |
| 0 | 0 | 0 | 0.013 | TMEM256_PLSCR3 |
| 0 | 0 | 0 | 0.027 | TMEM265 |
| 0 | 0 | 0 | 0.021 | TP63 |
| -1.0005 | -1.0005 | -1.0005 | 0.0087 | ZNF100 |
| -1.3428 | -1.3428 | -1.4156 | 0.0097 | ZWINT |
Enrichment analysis
#subset gene symbol deregulated using the interesting circRNA marker as stratificator
# geneList <- gene_mark$log2FoldChange[gene_mark$circIMPACT==markers.circrnas[5]]
geneList <- gene_mark_hipk3$log2FoldChange
names(geneList) <- gene_mark_hipk3$gene_id
# order gene list by foldchange
geneList = sort(geneList, decreasing = TRUE)
# names(geneList) <- gene_mark$Gene[gene_mark$circIMPACT==markers.circrnas[5]]
geneList <- geneList[abs(geneList)>1.5]
library(gprofiler2)
gostres2 <- gost(query = names(geneList)[names(geneList)!="."],
organism = "hsapiens", ordered_query = TRUE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", sources = NULL)
p <- gostplot(gostres2, capped = FALSE, interactive = TRUE)
p